Skip to content

[PPL][analytics-engine-compatibility] Lower isempty/isblank to CHAR_LENGTH = 0 instead of multiset IS_EMPTY#5439

Open
songkant-aws wants to merge 3 commits into
opensearch-project:mainfrom
songkant-aws:ppl-isempty-uses-charlength
Open

[PPL][analytics-engine-compatibility] Lower isempty/isblank to CHAR_LENGTH = 0 instead of multiset IS_EMPTY#5439
songkant-aws wants to merge 3 commits into
opensearch-project:mainfrom
songkant-aws:ppl-isempty-uses-charlength

Conversation

@songkant-aws
Copy link
Copy Markdown
Collaborator

Description

PPL's isempty(x) / isblank(x) were lowered in PPLFuncImpTable using SqlStdOperatorTable.IS_EMPTY against string operands. That operator is the SQL:2003 multiset emptiness predicate — its OperandTypeChecker is OperandTypes.COLLECTION_OR_MAP and its enumerable-runtime binding calls java.util.Collection.isEmpty() reflectively.

Passing a string slipped through only because:

  1. RexBuilder.makeCall bypasses the operand checker.
  2. Calcite's Linq4j codegen emits target.isEmpty() (a bare method-name call), which Janino then resolves against the actual operand type. String.isEmpty() happens to exist with the same signature, so the script-path runtime works by coincidence.

This shape is fragile:

  • Outside the script-pushdown path — most importantly Substrait emitters such as the analytics-engine route — there is no isthmus mapping for SqlStdOperatorTable.IS_EMPTY, and the plan is rejected with "unresolved function reference IS EMPTY".
  • The PredicateAnalyzer had to special-case-block DSL pushdown for the OR(IS_NULL, IS_EMPTY) shape because OpenSearch's non-short-circuiting bool.should would NPE evaluating name.isEmpty() on a null operand. So the entire OR was wrapped in a single script_query — a strict pessimization compared to a mixed bool.should.

Issues Resolved

This PR replaces the lowering with the predicate that actually matches the documented semantics:

isempty(x) -> OR(IS_NULL(x), CHAR_LENGTH(x) = 0)
isblank(x) -> OR(IS_NULL(x), CHAR_LENGTH(TRIM(BOTH ' ' FROM x)) = 0)

CHAR_LENGTH is a standard string operator with explicit type semantics, backed by SqlFunctions.charLength(String) in Calcite's enumerable runtime, and round-trips through Substrait's default extension catalog. Same observable behavior on existing PPL paths; correct behavior on Substrait-based paths.

The containIsEmptyFunction guard in PredicateAnalyzer.andOr is removed: CHAR_LENGTH(null) follows Calcite's NullPolicy.STRICT and returns null rather than NPE, so DSL's non-short-circuiting should evaluation is safe even when the field is null. This also unlocks better pushdown — the IS_NULL arm is now a native bool.must_not.exists clause and only the CHAR_LENGTH=0 arm remains a script. Null docs no longer touch the script engine.

The isNullOr_ScriptPushDown test was renamed to isEmpty_pushesIsNullArmAsExistsAndCharLengthArmAsScript and rewritten to assert the actual structural shape of the pushdown (2-arm bool.should with the expected types per arm) instead of just substring-matching the script lang marker.

Plan changes are reflected in the regenerated explain golden files for both calcite/ and calcite_no_pushdown/ variants.

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…_EMPTY

PPLFuncImpTable was emitting SqlStdOperatorTable.IS_EMPTY against string
operands as part of its isempty(x) / isblank(x) lowering. That operator
is the SQL:2003 multiset emptiness predicate — its OperandTypeChecker is
COLLECTION_OR_MAP and its enumerable-runtime binding calls
java.util.Collection.isEmpty() reflectively. Passing a string slipped
through only because RexBuilder.makeCall bypasses the operand checker
and Calcite's Linq4j codegen emits a bare `target.isEmpty()` that happens
to bind to String.isEmpty() at Janino compile time. Outside of the
script-pushdown path, the abuse breaks: any backend that does not run
RexNodes through Calcite's enumerable codegen — most notably Substrait
emitters such as the analytics-engine route — has no isthmus mapping
for SqlStdOperatorTable.IS_EMPTY and rejects the plan with
"unresolved function reference IS EMPTY".

Replace the lowering with the predicate that actually matches the doc:
  isempty(x) -> OR(IS_NULL(x), CHAR_LENGTH(x) = 0)
  isblank(x) -> OR(IS_NULL(x), CHAR_LENGTH(TRIM(BOTH ' ' FROM x)) = 0)

CHAR_LENGTH is a standard string operator with explicit type semantics,
backed by SqlFunctions.charLength(String) in Calcite's enumerable
runtime, and it round-trips through Substrait's default extension
catalog. Same observable behavior on existing PPL paths; correct
behavior on Substrait-based paths.

Also remove PredicateAnalyzer.containIsEmptyFunction. It existed solely
to abort DSL pushdown when it saw the OR(IS_NULL, IS_EMPTY) shape
because the SHOULD branches would NPE on null operands evaluating
.isEmpty() through the script. The new shape produces
OR(IS_NULL, CHAR_LENGTH = 0); the IS_NULL arm pushes down to
must_not.exists, the CHAR_LENGTH arm to a script that returns null
(not NPE) for null fields, and the SHOULD union answers correctly.

Test plan:
 - CalcitePPLConditionBuiltinFunctionIT.testIsEmpty / testIsBlank pass.
 - CalciteExplainIT.testExplainIsEmpty / testExplainIsBlank /
   testExplainIsEmptyOrOthers pass with regenerated golden plans for
   both pushdown-enabled and no-pushdown variants.
 - PredicateAnalyzerTest passes, including the unrelated
   equals_scriptPushDown_Struct test that still uses
   SqlStdOperatorTable.IS_EMPTY for its legitimate map-emptiness
   semantics.

Signed-off-by: Songkan Tang <songkant@amazon.com>
The previous test (isNullOr_ScriptPushDown) only asserted that the builder
string contained the opensearch_compounded_script lang marker. With the
isempty -> OR(IS_NULL, CHAR_LENGTH=0) lowering, the IS_NULL arm now pushes
down as a native bool.must_not.exists clause and only the CHAR_LENGTH=0
arm remains a script — so the substring assertion still passed without
verifying the actual structure. Renamed and rewritten to assert:

  - Top-level builder is a BoolQueryBuilder with exactly two should arms
    and no must / top-level must_not clauses.
  - Arm 0 is a BoolQueryBuilder whose mustNot wraps an ExistsQueryBuilder
    (the IS_NULL lowering).
  - Arm 1 is a ScriptQueryBuilder using the opensearch_compounded_script
    lang (the CHAR_LENGTH=0 lowering).

Also documents inline why the prior containIsEmptyFunction NPE-prevention
detector is no longer needed: CHAR_LENGTH(null) follows Calcite's STRICT
null policy and returns null instead of NPE, so DSL's non-short-circuiting
should evaluation is safe.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Reviewer Guide 🔍

(Review updated until commit 217d477)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Code Suggestions ✨

Latest suggestions up to 217d477
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Verify NULL handling in CHAR_LENGTH

The CHAR_LENGTH function may return NULL when arg is NULL, which could cause the
equality check to behave unexpectedly in some SQL dialects. Consider verifying that
the IS_NULL arm of the OR properly short-circuits evaluation to avoid potential
issues with NULL propagation in the equality comparison.

core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java [1283-1286]

+builder.makeCall(
+    SqlStdOperatorTable.EQUALS,
+    builder.makeCall(SqlStdOperatorTable.CHAR_LENGTH, arg),
+    builder.makeExactLiteral(BigDecimal.ZERO))
 
-
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that CHAR_LENGTH(null) returns null, but the PR's extensive comment (lines 1263-1275) and test code (lines 850-855) already explain that Calcite's NullPolicy.STRICT ensures null-safe behavior. The OR structure with IS_NULL as the first arm handles null inputs correctly. The suggestion asks for verification rather than identifying an actual issue, and the existing_code equals improved_code.

Low

Previous suggestions

Suggestions up to commit 7177ad6
CategorySuggestion                                                                                                                                    Impact
General
Verify null handling in CHAR_LENGTH

The CHAR_LENGTH function may return NULL when arg is NULL, which could cause the
equality check to behave unexpectedly in some SQL dialects. Consider verifying that
the IS_NULL check in the outer OR properly short-circuits evaluation, or add
explicit null handling if the backend doesn't guarantee short-circuit semantics.

core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java [1283-1286]

+builder.makeCall(
+    SqlStdOperatorTable.EQUALS,
+    builder.makeCall(SqlStdOperatorTable.CHAR_LENGTH, arg),
+    builder.makeExactLiteral(BigDecimal.ZERO))
 
-
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that CHAR_LENGTH(null) returns null, but the concern is addressed by the extensive comment (lines 1263-1275) explaining that Calcite's NullPolicy.STRICT ensures null-safe evaluation. The OR with IS_NULL handles null cases explicitly, making the behavior correct. The suggestion asks for verification rather than identifying an actual issue.

Low

@songkant-aws songkant-aws added PPL Piped processing language pushdown pushdown related issues enhancement New feature or request labels May 14, 2026
Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 217d477

@songkant-aws songkant-aws changed the title [PPL] Lower isempty/isblank to CHAR_LENGTH = 0 instead of multiset IS_EMPTY [PPL][analytics-engine-compatibility] Lower isempty/isblank to CHAR_LENGTH = 0 instead of multiset IS_EMPTY May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language pushdown pushdown related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant